Search CORE

9 research outputs found

Parallel algorithms for lattice QCD

Author: Roweth Duncan
Publication venue: The University of Edinburgh
Publication date: 01/01/1987
Field of study

SIGLEAvailable from British Library Document Supply Centre- DSC:D81971 / BLDSC - British Library Document Supply CentreGBUnited Kingdo

Edinburgh Research Archive

OpenGrey Repository

An In-Depth Analysis of the Slingshot Interconnect

Author: De Sensi Daniele
Di Girolamo Salvatore
Hoefler Torsten
McMahon Kim H.
Roweth Duncan
Publication venue
Publication date: 20/08/2020
Field of study

The interconnect is one of the most critical components in large scale computing systems, and its impact on the performance of applications is going to increase with the system size. In this paper, we will describe Slingshot, an interconnection network for large scale computing systems. Slingshot is based on high-radix switches, which allow building exascale and hyperscale datacenters networks with at most three switch-to-switch hops. Moreover, Slingshot provides efficient adaptive routing and congestion control algorithms, and highly tunable traffic classes. Slingshot uses an optimized Ethernet protocol, which allows it to be interoperable with standard Ethernet devices while providing high performance to HPC applications. We analyze the extent to which Slingshot provides these features, evaluating it on microbenchmarks and on several applications from the datacenter and AI worlds, as well as on HPC applications. We find that applications running on Slingshot are less affected by congestion compared to previous generation networks.Comment: To be published in Proceedings of The International Conference for High Performance Computing Networking, Storage, and Analysis (SC '20) (2020

arXiv.org e-Print Archive

Repository for Publications and Research Data

Optimised Collectives on QsNet II

Author: Ashley Pittman
Duncan Roweth
Jon Beecroft
Publication venue
Publication date
Field of study

In this paper we present an in-depth description of how QsNet II supports collectives. Performance data from jobs run on 256-1024 node clusters show that the time to complete barrier synchronization is as low as 5 microseconds, with very good scalability. Results for broadcast indicate that QsNet II can deliver data to 512 nodes in 8-10 microsecs and can sustain an asymptotic bandwidth in excess of 800 Mbytes/sec to all nodes. A 512-node reduction (64-bit floating point sum) completes in 18 microsecs. A gather of a single word from 1024 nodes completes in 30 microsecs. The all-to-all collective delivers 350 Mbytes/sec/node (85 % of peak) across 512-1024 nodes

CiteSeerX

An In-Depth Analysis of the Slingshot Interconnect

Author: De Sensi Daniele
Di Girolamo Salvatore
Hoefler Torsten
McMahon Kim H.
Roweth Duncan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/11/2020
Field of study

The interconnect is one of the most critical components in large scale computing systems, and its impact on the performance of applications is going to increase with the system size. In this paper, we will describe SLINGSHOT , an interconnection network for large scale computing systems. SLINGSHOT is based on high-radix switches, which allow building exascale and hyperscale datacenters networks with at most three switch-to-switch hops. Moreover, SLINGSHOT provides efficient adaptive routing and congestion control algorithms, and highly tunable traffic classes. SLINGSHOT uses an optimized Ethernet protocol, which allows it to be interoperable with standard Ethernet devices while providing high performance to HPC applications. We analyze the extent to which SLINGSHOT provides these features, evaluating it on microbenchmarks and on several applications from the datacenter and AI worlds, as well as on HPC applications. We find that applications running on SLINGSHOT are less affected by congestion compared to previous generation networks

Repository for Publications and Research Data

Shared Memory Programming on the Meiko CS-2

Author: Duncan Roweth
Jorg-thomas Pfenning
Jörg-Thomas Pfenning
Meiko Cs
Meiko Cs
Thomas Pfenning
Zentrum Fur Paralleles Rechnen
Publication venue
Publication date: 01/01/1995
Field of study

An interesting feature of some recent parallel computers is the fact that the underlying transport mechanism behind the currently dominating message passing interfaces is based on a global address space model. By accessing this global adress space directly most of the inherent delays for administering message buffers and queues can be avoided. Using this interface we have implemented a user level distributed shared memory layer using the virtual memory protection mechanisms of the operating system. The synchronisation required for maintaining the coherency of the memory is addressed by implemtenting a distributed shared lock which exploits the remote atomic store operations provided by the Meiko CS-2. This allows an asynchonous style of programming where the load is dynamically distributed over the nodes of a parallel partition

CiteSeerX

computer science publication server

Not all applications have boring communication patterns: Profiling message matching with BMM

Author: Alverson Bob
Cook Brandon
Groves Taylor
Keen Noel
Ravichandrasekaran Naveen
Roweth Duncan
Trebotich David
Underwood Keith
Wright Nicholas J
Publication venue: 'Wiley'
Publication date: 10/07/2023
Field of study

Message matching within MPI is an important performance consideration for applications that utilize two-sided semantics. In this work, we present an instrumentation of the CrayMPI library that allows the collection of detailed message-matching statistics as well as an implementation of hashed matching in software. We use this functionality to profile key DOE applications with complex communication patterns to determine under what circumstances an application might benefit from hardware offload capabilities within the NIC to accelerate message matching. We find that there are several applications and libraries that exhibit sufficiently long match list lengths to motivate a Binned Message Matching approach

Crossref

eScholarship - University of California

Network-Accelerated Non-Contiguous Memory Transfers

Author: Benini Luca
Beránek Jakub
Besta Maciej
Di Girolamo Salvatore
Hoefler Torsten
Kurth Andreas
Roweth Duncan
Schaffner Michael
Schneider Timo
Taranov Konstantin
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2019
Field of study

Repository for Publications and Research Data

General in-network processing – time is ripe!

Author: Benini Luca
Beránek Jakub
Besta Maciej
Brightwell Ron
De Sensi Daniele
Di Girolamo Salvatore
Grant Ryan E.
Hoefler Torsten
Kurth Andreas
Roweth Duncan
Schaffner M.
Schneider Timo
Taranov Konstantin
Publication venue: ETH Zurich, Scalable Parallel Computing Laboratory
Publication date: 01/01/2020
Field of study

Remote memory access (RDMA) networks have been around for more than a decade. RDMA hardware enables basic put/get operations into userlant at very high speeds and reduces CPU overheads significantly. However, we observe that CPU requirements for processing data at modern speeds of 400 or 800 Gbit/s are still huge. Modern smart NICs add various processing capabilities ranging from fully-fledged ARM cores to FPGA-accelerated NICs. However, all current implementations are either relatively inefficient for line-rate packet processing or offer only limited functions such as header rewriting. We advocate for a fully flexible model that allows to execute arbitrary C code on each packet. We show that 'streaming Processing in the Network' (sPIN) enables such a model. Our implementation based on RISC-V demonstrates that generic network acceleration is feasible and delivers an efficiency improvement of up to 100x. We release our implementations as open source and expect that more vendors will adopt generic in-network computations in addition to RDMA

Repository for Publications and Research Data